Joint Segmentation and Clustering in Text Corpuses
نویسندگان
چکیده
In recent years, many private corporations and government organizations have digitized corpuses of legacy paper documents. Often, these organizations hope to take advantage of digital representations to transform costly manual tasks associated with paper archives into less-costly computer-assisted tasks. The most common approach toward automated information extraction is through inverted indexing systems that allow fast keyword searches. Keyword-based indexing, however, is ine ective for tasks that require information from higherlevel contexts. To allow for more e ective information extraction from digital corpuses, we propose combining two common document processing tasks, (i) clustering and (ii) segmentation, into one process to simultaneously segment documents within a corpus and assign each segment to a category. We have developed a generative probabilistic model to accomplish this task, which we call the Joint Segmentation and Clustering (JSC) model. From experiments measuring segmentation and clustering ability, we show that our model can accurately partition documents and assign meaningful categories to each partition. In addition, experiments tracking predictive perplexity show that our JSC model outperforms basic topic modeling approaches in terms of conciseness of the induced representation.
منابع مشابه
A Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملUsing an Evolving Thematic Clustering in a Text Segmentation Process
The thematic text segmentation task consists in identifying the most important thematic breaks in a document in order to cut it into homogeneous passages. We propose in this paper an algorithm for linear text segmentation on general corpuses. It relies on an initial clustering of the sentences of the text. This preliminary partitioning provides a global view on the sentences relations existing ...
متن کاملImage Segmentation: Type–2 Fuzzy Possibilistic C-Mean Clustering Approach
Image segmentation is an essential issue in image description and classification. Currently, in many real applications, segmentation is still mainly manual or strongly supervised by a human expert, which makes it irreproducible and deteriorating. Moreover, there are many uncertainties and vagueness in images, which crisp clustering and even Type-1 fuzzy clustering could not handle. Hence, Type-...
متن کاملCombining Character-Based and Subsequence-Based Tagging for Chinese Word Segmentation
Chinese word segmentation is the initial step for Chinese information processing. The performance of Chinese word segmentation has been greatly improved by character-based approaches in recent years. This approach treats Chinese word segmentation as a character-wordposition-tagging problem. With the help of powerful sequence tagging model, character-based method quickly rose as a mainstream tec...
متن کاملSegGen: A Genetic Algorithm for Linear Text Segmentation
This paper describes SegGen, a new algorithm for linear text segmentation on general corpuses. It aims to segment texts into thematic homogeneous parts. Several existing methods have been used for this purpose, based on a sequential creation of boundaries. Here, we propose to consider boundaries simultaneously thanks to a genetic algorithm. SegGen uses two criteria: maximization of the internal...
متن کامل